34 research outputs found
Shortest Unique Substring Queries on Run-Length Encoded Strings
We consider the problem of answering shortest unique substring (SUS) queries on run-length encoded strings. For a string S, a unique substring u = S[i..j] is said to be a shortest unique substring (SUS) of S containing an interval [s, t] (i j\u27-i\u27, S[i\u27..j\u27] occurs at least twice in S.
Given a run-length encoding of size m of a string of length N, we show that we can construct a data structure of size O(m+pi_s(N, m)) in O(m log m + pi_c(N, m)) time such that queries can be answered in
O(pi_q(N, m) + k) time, where k is the size of the output (the number of SUSs), and pi_s(N,m), pi_c(N,m), pi_q(N,m) are, respectively, the size, construction time, and query time for a predecessor/successor query data structure of m elements for the universe of [1,N]. Using the data structure by Beam and Fich (JCSS 2002), this results in a data structure of O(m) space that is constructed in O(m log m) time, and answers queries in O(sqrt(log m/loglog m)+k) time
Tight Bounds on the Maximum Number of Shortest Unique Substrings
A substring Q of a string S is called a shortest unique substring (SUS) for interval [s,t] in S, if Q occurs exactly once in S, this occurrence of Q contains interval [s,t], and every substring of S which contains interval [s,t] and is shorter than Q occurs at least twice in S. The SUS problem is, given a string S, to preprocess S so that for any subsequent query interval [s,t] all the SUSs for interval [s,t] can be answered quickly. When s = t, we call the SUSs for [s, t] as point SUSs, and when s <= t, we call the SUSs for [s, t] as interval SUSs. There exist optimal O(n)-time preprocessing scheme which answers queries in optimal O(k) time for both point and interval SUSs, where n is the length of S and k is the number of outputs for a given query. In this paper, we reveal structural, combinatorial properties underlying the SUS problem: Namely, we show that the number of intervals in S that correspond to point SUSs for all query positions in S is less than 1.5n, and show that this is a matching upper and lower bound. Also, we consider the maximum number of intervals in S that correspond to interval SUSs for all query intervals in S
Sliding suffix trees simplified
Sliding suffix trees (Fiala & Greene, 1989) for an input text over an
alphabet of size and a sliding window of can be maintained in
time and space. The two previous approaches that
achieve this can be categorized into the credit-based approach of Fiala and
Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by
Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree
can be supplemented with leaf pointers in order to find all occurrences of an
online query pattern in the current window, and that leaf pointers can be
maintained by credit-based arguments as well. The main difficulty in the
credit-based approach is in the maintenance of index-pairs that represent each
edge. In this paper, we show that valid edge index-pairs can be derived in
constant time from leaf pointers, thus reducing the maintenance of edge
index-pairs to the maintenance of leaf pointers. We further propose a new
simple method which maintains leaf pointers without using credit-based
arguments. Our algorithm and proof of correctness are much simpler compared to
the credit-based approach, whose analyses were initially flawed (Senft 2005).Comment: 12 pages + 5 pages of appendix. 18 figures in tota
Finding Top-k Longest Palindromes in Substrings
Palindromes are strings that read the same forward and backward. Problems of
computing palindromic structures in strings have been studied for many years
with a motivation of their application to biology. The longest palindrome
problem is one of the most important and classical problems regarding
palindromic structures, that is, to compute the longest palindrome appearing in
a string of length . The problem can be solved in time by the
famous algorithm of Manacher [Journal of the ACM, 1975]. This paper generalizes
the longest palindrome problem to the problem of finding top- longest
palindromes in an arbitrary substring, including the input string itself.
The internal top- longest palindrome query is, given a substring
of and a positive integer as a query, to compute the top- longest
palindromes appearing in . This paper proposes a linear-size data
structure that can answer internal top- longest palindromes query in optimal
time. Also, given the input string , our data structure can be
constructed in time. For , the construction time is reduced
to
String Sanitization Under Edit Distance: Improved and Generalized
International audienceLet W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string X ED such that: (i) no string of S occurs in X ED ; (ii) the order of all other length-k substrings over Σ is the same in W and in X ED ; and (iii) X ED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n 2 k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n 2−δ) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: • An O(n 2 log 2 k)-time algorithm to solve ETFS. • An O(n 2 log 2 n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths